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Introduction 


This paper discusses a series of options to have Cubesat electronics monitor itself for degraded 
performance or pending failures due to on-orbit environmental conditions, and provide remediation 
when possible. From case studies, engineering analysis, and testing, it is fairly well understood how 
systems fail. One approach is to have a watchdog system to keep an eye on the main system. But how 
do we make the watchdog more reliable than the main program, a watch-cat? One of the issues in 
system failure is complexity, and complexity consists of the number of parts, and the number of 
interactions between parts. This has been shown to be true in both hardware and software. 


Another approach which works well is Triple Modular Redundancy (TMR). Here, we have three 
identical systems, and a trusted voter. This works since the probability of two failures is less than that 
of one. But, in keeping aligned with Murphy's law, anything that can go wrong, will go wrong (at the 
worst possible time), we need to characterize the potential failures, figure out how to detect them, and 
define how the system might be recovered. 


This paper will focus on an approach that is software-based, and running on the system it is testing. 
From formal testing results, and with certain key engineering tools, we can come up with likely failure 
modes, and possible remediations. Besides self-test, we can have cross-checking of systems. Not 
everything can be tested by the software, without some additional hardware. First we will discuss the 
engineering analysis that will help us define the possible hardware and software failure cases, and then 
we will discuss possible recovery actions and remediations. None of this is new, but the suggestion is to 
collect together best practices in the software testing area, develop a library of Rad-Hard 
Software(RHS) routines, and get operational experience. These routines should be open source. 


Another advantage of the software approach is that we can change it after launch, as more is learned, 
and conditions change. 


This paper explores an approach to detect and respond to pending radiation damage to flight computers. 
We will focus on Cubesat missions, but the techniques are applicable across a wide spectrum of critical 
real-time embedded systems. Our primary target platform will be the 32-bit ARM architecture, such as 
found in the Raspberry Pi. We will also consider the development of an Open Source software package 
for possible inclusion in the the Core Flight System, NASA-GSFC, Code 582, Flight Software. 


The major problem for Spaceflight computers is radiation, although there are other environmental 


issues, and there can always be hardware and software residual errors that made it through testing. We 
will discuss radiation first. 


Radiation Hardness Issues for Space Flight Applications 


A complete discussion of the physics of radiation damage to semiconductors is beyond the scope of this 
document. However, an overview of the subject is presented. The tolerance of semiconductor devices 
to radiation must be examined in the light of their damage susceptibility. The problems fall into two 
broad categories, those caused by cumulative dose, and those transient events caused by asynchronous 
very energetic particles, such as those experienced during a period of intense solar flare activity. The 
unit of absorbed dose of radiation is the rad, representing the absorption of 100 ergs of energy per gram 
of material. A kilo-rad is one thousand rads. At 10k rad, death in humans is almost instantaneous. One 
hundred kilo-rad is typical in the vicinity of Jupiter's radiation belts. Ten to twenty kilo-rad is typical 
for spacecraft in low Earth orbit, but the number depends on how much time the spacecraft spends 
outside the Van Allen belts, which act as a shield by trapping energetic particles. 


Absorbed radiation can cause temporary or permanent changes in the semiconductor material. Usually 
neutrons, being uncharged, do minimal damage, but energetic protons and electrons cause lattice or 
ionization damage in the material, and resultant parametric changes. For example, the leakage current 
can increase, or bit states can change. Certain technologies and manufacturing processes are known to 
produce devices that are less susceptible to damage than others. More expensive substrate materials 
such as diamond or sapphire help to make the device more tolerant of radiation, but much more 
expensive. 


Radiation tolerance of 100 kilo-rad is usually more than adequate for low Earth orbit (LEO) missions 
that spend most of their life below the shielding of the Van Allen belts. For Polar missions, a higher 
total dose is expected, from 100k to 1 mega-rad per year. For synchronous, equatorial orbits, that are 
used by many communication satellites, and some weather satellites, the expected dose is several kilo- 
rad per year. Finally, for planetary missions to Venus, Mars, Jupiter, Saturn, and beyond, requirements 
that are even more stringent must be met. For one thing, the missions usually are unique, and the cost 
of failure is high. For missions towards the sun, the higher fluence of solar radiation must be taken into 
account. The larger outer planets, such as Jupiter and Saturn, have their own large radiation belts 
around them as well. The radiation environment at Jupiter is thousands of times stronger than at Earth. 


Cumulative radiation dose causes a charge trapping in the oxide layers, which manifests as a parametric 
change in the devices. Total dose effects may be a function of the dose rate, and annealing of the device 
may occur, especially at elevated temperatures. Annealing refers to the self-healing of radiation induced 
defects. This can take minutes to months, and is not applicable for lattice damage. The internal 
memory or registers of the cpu are the most susceptible area of the chip, and are usually deactivated 
for operations in a radiation environment. The gross indication of radiation damage is the increased 
power consumption of the device, and one researcher reported a doubling of the power consumption at 
failure. In addition, failed devices would operate at a lower clock rate, leading to speculation that a key 
timing parameter was being effected in this case. 


Single event upsets (seu's) are the response of the device to direct high energy isotropic flux, such as 
cosmic rays, or the secondary effects of high energy particles colliding with other matter (such as 
shielding). Large transient currents may result, causing changes in logic state (bit flips), unforeseen 
operation, device latch-up, or burnout. The transient currents can be monitored as an indicator of the 
onset of SEU problems. After SEU, the results on the operation of the processor are unpredictable. 


Mitigation of problems caused by SEU's involves self-test, memory scrubbing, and forced resets. 


The LET (linear energy transfer) is a measure of the incoming particles' delivery of ionizing energy to 
the device. Latch-up refers to the inadvertent operation of a parasitic SCR (silicon control rectifier), 
triggered by ionizing radiation. In the area of latch-up, the chip can be made inherently hard due to use 
of the Epitaxial process for fabrication of the base layer. Even the use of an Epitaxial layer does not 
guarantee complete freedom from latch-up, however. The next step generally involves a silicon on 
insulator (SOT) or Silicon on Sapphire (SOS) approach, where the substrate is totally insulated, and 
latch-ups are not possible. This is an expensive approach, 


In some cases, shielding is effective, because even a few millimeters of aluminum can stop electrons 
and protons. However, with highly energetic or massive particles (such as alpha particles, helium 
nuclei), shielding can be counter-productive. When the atoms in the shielding are hit by an energetic 
particle, a cascade of lower energy, lower mass particles results. These can cause as much or more 
damage than the original source particle. 


Cumulative dose and single events 


The more radiation that the equipment gets, in low does for a long time, or in high does for a shorter 
time, the greater the probability of damage. The Total Ionization Dose (TID) accumulates over time, 
and actually displaces the semiconductor lattice structure. It causes shifts in the threshold voltage 
device, and noticeable increased current draw. The damage can become permanent. TID isn't the major 
concern, as devices become smaller, and the oxide gates become thinner, as technology advances. The 
higher the voltage, though, the more problematic the effect can be. Analog to digital converters can 
experience conversions shifts. 


These events are caused by high energy particles, usually protons, that disrupt and damage the 
semiconductor lattice. The effects can be upsets (bit changes) or latch-ups (bit stuck). The damage can 
“heal” itself, but its often permanent. Most of the problems are caused by energetic solar protons, 
although galactic cosmic rays are also an issue. Solar activity varies, but is now monitored by sentinel 
spacecraft, and periods of intensive solar radiation and particle flux can be predicted. Although the Sun 
is only 8 light minutes away from Earth, the energetic particles travel much slower than light, and we 
have several days warning. During periods of intense solar activity, Coronal Mass Ejection (CME) 
events can send massive streams of charged particles outward. These hit the Earth’s magnetic field and 
create a bow wave. The Aurora Borealis or Northern Lights are one manifestation of incoming charged 
particles hitting the upper reaches of the ionosphere. 


Cosmic rays, particles and electromagnetic radiation, are omni-directional, and come from extra-solar 
sources. Most of them, 85%, are protons, with gamma rays and x-rays thrown in the mix. Energy levels 
range to 10° to 10° electron volts (eV). These are mostly filtered out by Earth’s atmosphere. There is no 
such mechanism on the Moon, where they reach and interact with the surface material. Solar flux 
energy's range to several Billion electron volts (Gev). 


Single Event Upsets (SEU) are instantaneous events, caused by highly energetic particles such as 
Cosmic Rays. This causes momentary bit flips, but is generally not cumulative. Some events may 
require a reset to affect recovery of state. 


There is a minimum of data for radiation effects on the popular ARM architectures, used in the 
Raspberry Pi and Arduino, and this should be addressed immediately, as they are the architectures of 


choice for Cubesats. For emerging architectures such as the Intel Edison, there is no radiation tolerance 
data. 


Mitigation Techniques 


The effects of radiation on silicon circuits can be mitigated by redundancy, the use of specifically 
radiation hardened parts, Error Detection and Correction (EDAC) circuitry, and scrubbing techniques. 
Hardened chips are produced on special insulating substrates such as Sapphire. Bipolar technology 
chips can withstand radiation better than CMOS technology chips, at the cost of greatly increased 
power consumption. Shielding techniques are also applied. In error detection and correction techniques, 
special encoding of the stored information provides a protection against flipped bits, at the cost of 
additional bits to store. Redundancy can also be applied at the device or box level. 


Other areas of potential problems are thermal. The cpu can monitor its own temperature, as there is 
usually a dedicated diode junction hooked to an A/D port that can be read to see the cpu's internal 
temperature. 


Engineering Tools 


Various system engineering tools are available to use both before and after “incidents.” Ideally, we 
review the systems before hand with respect to failure and safety, and factor these issues into the 
implementation plan. Many factors marginalize this approach, including management focus, and 
impact on systems cost and schedule. 


After the fact, we have the Root Cause analysis method, so we can determine what exactly failed, and 
how this particular issue can be addressed and mitigated. In many cases, this process uncovers other 
latent issues that also need to be addressed. These need to be documented as case studies for the benefit 
of future projects. 


Root Cause Analysis 


Root Cause Analysis refers to an engineering process to identify and categorize the causes of events, 
and to identify the primal cause. It is a useful tool for determining why a disaster happened. It is used to 
define the what, how, and why (and sometimes, who)? Its value is that it will lead to a definition of 
corrective measures that can be applied in the future. 


By definition, root causes are underlying, identifiable, and controllable. The RCA process includes a 
data collection phase (forensics), a cause charting, the root cause identification, recommendations, and 
implementation of the solution to avoid repeating the error. In many cases, the RCA will uncover other 
failure causes that were overlooked. There are software tools available to assist in the RCA process. 


Keep in mind, your solution should not introduce additional problems. 


FMEA 


The failure modes and effects analysis is an engineering tool that is applied during the design and 
testing process of a system. In this approach, we postulate failure modes, and analyze their impact on 
the system performance. The possible failure modes are examined to confirm their validity. Then, the 
possible failures are prioritized by severity and consequences. The goal is to identify and eliminate 
failures in the order of decreasing severity. 


The FMEA approach can actually start at the Project conceptual phase, and continue throughout the 
project life-cycle. It can (and should) be applied to modifications to existing projects. The origins of the 
FMEA approach were during World War-II, by the U. S. Military. After the war, the approach was 
adopted by the aviation (aerospace) and automotive industries. 


The FMEA analysis requires a cross-functional team, consisting, as applicable of hardware and 
software engineers, manufacturing, Quality Assurance, test engineers, reliability engineers, parts, and, 
ideally, the customer. 


The process involves identifying the scope of the project, defining the boundaries and the desired level 
of detail. Then, the system (or project) functions are identified. Each function is analyzed to identify 
how it could fail. For each of these failure cases, the consequences are noted. These range from no 
effects to catastrophic. Formally, the consequences are rated on a scale of 1 to 10, with | being 
insignificant to 10, catastrophic. The root cause is then determined for each consequence, starting with 
the 10's. Software tools are available to support this analysis process. 


Once the causes are determined, the controls are defined. Controls prevent the cause from happening, 
reduce the probability of happening, or detect the failure in time for correction to be applied. For each 
control, then, a detection probability rating is calculated (or estimated), again on a scale of 1-10. Here, 
1 indicates that control is certain, and 10 indicates that the solution will not work. By definition, critical 
characteristics of the system have a severity of 9 or greater, and have an occurrence and detection 
rating of greater than 3. 


A Risk Priority Number (RPN) is calculated, which is severity times occurrence time detection 
(ratings). This measure is used to rank failure modes in the order in which they are to be addressed. Of 
course, some of these rankings are not measurable, but the result of good engineering guesses From an 
FMEA, you can develop contingency plans, adjust to the identified failure scenarios. It is always good 
to have a Plan B. Also C, D, E...... 


Fault Tolerant Design 


In this design approach, hardware or software is designed to continue to operate properly in the event 
of one or more failures. It is sometimes referred to as graceful degradation. There is, of course, a limit 
to the number of faults or failures than can be handled, and the faults or failures may not be 
independent. Sometimes, the system can be designed to degrade, but not fail, as a result of the fault. 
Fault recovery in a fault-tolerate design is either roll forward, or roll back. Roll back refers to returning 
the system state to a previous check-pointed state. Roll forward corrects the current system state to 
allow continuation. 


The current Flight Computer of choice for traditional space missions is the RAD-750. It is radiation- 
hardened. It's specification is not more than one upset in 15 years. An upset is defined as a situation 
requiring intervention form the control center. That is very inconvenient when your computer has just 
gone past Pluto. The classic case of interplanetary debugging was done from JPL on the Mars 
Pathfinder rover on the surface of Mars, and found a priority inversion issue that did not arise during 
test. Part of the reason this was possible was that certain test features were left in the flight code. This 
included a trace/log module. Without this, identifying the error would not have been possible. 


Redundancy 


Redundancy refers to the technique of having multiple copies of critical components. Belt and 


suspenders; two independent ways to accomplish the same results. Either can fail without affecting the 
other. This can refer to hardware or software. This increases the reliability of the system. Redundant 
units can be deployed in parallel, such as extra memory, where each single unit can handle the load. 
This provides what is referred to as a margin of safety. We can also consider using the memory 
management unit to move things around in physical memory, to avoid damaged areas. 


In certain systems that are responsible for safety-critical tasks, we might triplicate the critical portion, 
which, reduces the probability of system failure to small, acceptable, levels. This approach is found in 
aircraft controls, nuclear power plant controls, and many more, safety-critical systems. 


Of course, if there is a common error in the three units, we have not increased our reliability. This 
situation is referred to as a common mode or single point error. Another problem is in the voting logic, 
that makes the decision that an error has been made, and switches controllers. At least one satellite 
launch failed because the voting logic made the wrong choice. Redundancy carries penalties in size, 
weight, power, cost, and testing complexity. 


Fault isolation allows the system to operate around the failed component, using backup or alternative 
modules. Fault containment strives to isolate the fault, and prevent propagation of the failure. 


Systems can be designed to be fail-safe, fail-soft, or can be “melt-before-fail.” The more fault tolerance 
that is built into a system, the more it will cost, the slower it will operate, and the more difficult it will 
be to test. It is important not to increase the complexity to the point where the system is not testable, 
and is “designed to fail.” 


Fault Tree 


A fault tree is an engineering tool to provide a graphical representation of faults and their causes. It 
provides a top-down deductive analysis for a system. There are software tools to facilitate the 
construction of the tree, which allows you to visualize binary decision points. It must be complete (full 
fault coverage) and correct. 


Fault Tolerance 


Fault tolerance refers to the feature of a system that allows for certain faults or certain sequences of 
faults to not affect operation or safety. The system might be said to be capable of graceful degradation. 
Obviously, there is a limit to how many faults the system can survive, and many real faults cause 
subsequence ripple-effects. Loss of attitude control causes a shortage power, which causes 
communication to be lost, which means remote debugging and re-mediative action can't be taken. 


A defective design can be identified and corrected before it leads to an accident thanks to quality 
assurance and analysis of flight parameters during testing and operation of any space hardware. This is 
where complacency comes into play. Failures involving complex systems are always preceded by so- 
called accident precursors, which take the form of parameters out of tolerance. People responsible for 
those systems become accustomed to the idea that nothing bad will ever happen because nothing bad 
has happened yet. Out of tolerance or limits can be sensed. This includes current draw, temperature, 
time-to-complete, etc. 


Fault Containment 


Fault Containment means we might have a fault, but it can be contained or isolated locally. This 


minimize the impact of the fault, and keep it from getting worse. Opportunities for fault containment 
are in the design phase, after a good FMEA has been done. Then, the systems can be analyzed with a 
goal of minimizing subsequent faults and failures. If a memory unit has excessive problems, we can 
map it out of the memory space, using the MMU. 


Mitigation 
Fault mitigation means you have designed the system with what we might call reflex actions in 
response to a detected fault, and healing actions, that offset the failure. This can't be done across all 


faults, but an evaluation of the system with various what-if scenarios, may allow for automatic 
responses to certain faults and failures. 


BIST 


Built-in self-test is part of the design-for-testability philosophy. It is applicable at the box, board, or 
chip level. It defines the inclusion of additional circuitry or code specifically for testing purposes. It 
may include software components, or diagnostic cores for FPGA’s. For example, when a standard pc 
system is reset, the initial code, in the BIOS, performs a series of functional tests on the hardware of the 
board hardware. This is generally referred to as POST — Power-On-Self-Test. The technique of BIST 
was first used operationally on the Minuteman missile, and is still very applicable. Essentially, RHS is 
BIST. 


Self Monitoring Systems 


Homeostasis refers to a system that monitors, corrects, and controls its own state. Our bodies do that 
with our blood pressure, temperature, blood sugar level, and many other parameters. 


In at least one case I know of, the backup flight computer erroneously thought the primary machine 
made a mistake, and took over control. It was wrong, and caused a system failure. 


To counter the effects of “bit flips” and other effects of radiation, the memory can be designed with 
error detection and correction (EDAC). Generally, this means a longer, encoded word that can detect N 
and correct M errors. There is a trade-off with price. With EDAC memory, there is a low priority 
background task running on the cpu that continuously reading and writing back to memory. This 
process, called “memory scrubbing” will catch and correct errors. 


Self-test software can run as a background task. This can also send a “heart-beat” signal to another 
processor or monitoring logic, particularly if we are using configurable logic, we can include built-in 
self test (BIST). 


A special-purpose timer essential for embedded applications is the Watchdog. This is a free-running 
timer that generates a cpu reset unless it is reset by the software. This helps to ensure that the system 
doesn’t lock up during certain critical time periods, and the software is meeting its deadlines. This 
approach has saved many a system. 


If the watchdog is not reset, it generates an interrupt to reset the host. This should take the system back 
to a baseline state, and restart it. Hopefully, normal operations will resume. The system can’t rely on a 
human operator to notice a fault in the operations or a “hung” system, and press the reset button. Many 
very remote systems, such as those in deep water or on the surface of other planets have successfully 
recovered from faults with a watchdog. 


The watchdog timer is implemented in hardware, and does it’s job without direct software intervention. 


If the software fails to reset the timer, the system reboots. This might simply restart operations or may 
include diagnostics before or after the system is restarted. So, who watches the watchdog? 


The bottom line is, we need to have a good, current, and correct characterization of the system and the 
environment it will be operating in. We postulate faults and how we can react to them, and minimize 
their damage. This won't be possible in all cases. We will also get some indications from the test phase 
of the project on how things failed, and, ideally, why. Most of the time, we are restricted in our 
mitigation approach to selecting redundant systems, or doing a re-boot. The important thing is, we save 
as much data as we can for later analysis. Do a Fault Case Study and share it. Go on to implement new 
and more obscure failure modes. 


Elements of RHS 


The RHS has many diverse pieces, and is not just one software module, but can be dispersed. Some of 
the RHS modules run continuously and some are triggered on demand, due to a specific event. It is 
desirable to have as much fault/failure coverage as possible, while minimizing the impact on the host's 
memory and timing. 


You're way ahead when you have some idea what is likely to fail, derived from testing, industry 
reports, and case studies. Fault coverage has to be as complete as possible, but we should ensure we 
have the known failure modes covered. Of course, some failures were missed in testing, resulting in 
their presence becoming known later in the operational environment. 


It is also critically important to know exactly what software has been loaded into the flight computer. 
What if you have multiple copies, and don't know which one is in orbit. Configuration Control prevents 
that, right? It has happened. 


There is also a general policy of “test what you fly, fly what you test.” You might have included 
diagnostic code for integration testing, and pull it out before flight. Wrong. Now the code you are 
going to fly is untested. The tested version include the instrumentation code. Even though it will never 
be used, it takes up some space, so cache footprints, memory boundary's, and pipeline contents can be 
different. 


We also need to carefully consider the failure recovery. Sometimes, we will need the system to reboot 
itself. That's disruptive, but necessary in some cases. We want to take every possible path before going 
down that one. 


CPU failures are fairly rare, but the flight computer is operating in a hostile environment. There are 
known failure modes in this environment, that have to be covered. Failures will be transient or hard. 
Sometimes, hard failures result in a state that is not recoverable. Transient failures, on the other hand, 
are the hardest to find. We can observe the results, and try to work backward to the root cause. That is 
where good up-front analysis and data from system test is invaluable. Some architectures, such as the 
ARM Cortex-R7 have built-in hardware failure detection. That's a good approach, but it leaves many 
potential failures uncovered. 


We willl discuss some potential RHS modules. There is some overlap and duplication in functionality. 
The impact on system performance should be determined during test. Some of these modules address 
other areas beyond radiation-damage. Just because the cpu is in a high radiation environment, doesn't 
mean a spurious interrupt can't occur. 


We can tap industry best practices code for system testing. We can also use testing code developed for 
system POST (power-on-self-test) as an example. POST is accomplished after a reset, but before the 
system begins to run operational code. It does allowed for checking internal functionality. POST should 
certainly be included in our repertoire. POST doesn't have specific run time requirements (except the 
annoyance threshold). A large block of memory can be tested in sections, during operations, to avoid 
adversely affecting system timing. 


Other functions to be included in ?RHS include: 


1. Current monitoring, used to check on potential radiation damage. This is the primary to catch 
radiation damage, as the telltale is gradually increased current consumption. The data should be 
logged, and once past red limits, do a reset. The sampling rate can be on the order of hours. 


2. Self-diagnosis — this routine tests the functionality of the cpu and I/O, but not memory, which is 
covered in another module. The cpu test is a functional test, applying operations and checking 
for a predefined result. At the same time, we are verifying that the instruction pipeline, the 
cache, the registers, and the ALU are working. If this test fails, the computer must be considered 
non-functional. 


3. Spurious interrupts test — It is critical that the entire interrupt vector table is filled, with unused 
interrupts’ vectoring to an error routine. This routine, when invoked, will log the error, and 
return. This is a zero or low overhead approach. There's not much we can do, as the interrurts 
are hard-wired to particular (or software, in the case of attempter division by zero). 


4. Memory test continuously running as a low priority task. This routine checks non-used memory 
with a write-readback. It marks areas of memory (address ranges) that have a problem. With a 
precomputed checksum of code and constants memory, we can do an XOR operation across all 
the contents. Memory testing algorithms are widely available (moving ones, moving zeros, 
walking bits, cross-talk, etc), but we must consider that what we choose will be running in an 
operational system, not a system-under-test in a lab, and plan for minimal impact on operations. 


5. Checksum over code is a routine that calculates a checksum over the code region and compares 
it with the known good value. If there is a mis-match, it reloads code space from a backup 
location, after checking the checksum of the backup. This routine can be run continuously as a 
low priority task. 


6. Data corruption error testing — we can monitor fixed values in scattered locations in memory for 
change, which would indicate data corruption in that area. These could be remapped by. 
adjusting the memory management tables to map out that region. We would assume we lost 
more than just our test word. 


7. Memory scrub is a low priority task, that reads and rewrites each memory location. It is like 
DRAM refresh, without the hardware to do it in the background. This technique will sometimes 
fix pending memory errors. This is less of a diagnostics than a “wellness” routine. 


8. I/O functionality test — This requires external hardware, such as loop-backs, to accomplish. It is 
not feasible to test hardware links that are dedicated, and in use. For the unused links, like a 
spare serial channel, we can provide for loopback at the board level. A failure in this case casts 
suspicion on the entire A/D unit, and further examinations are needed. 


9. Peripherals test. Thise is very specific to the device being tested, and interferes with operations. 
The good news is, some devices can run self-diagnostics, and report back results. This class of 
test can be run on redundant units (say, the 4" gyro) in a non-interference manner. We can also 
power up and test specific spare units, and then swap them in and test the primary. 


10. Watchdog timer — if not reset by the main code, it will reset the system. Here we implement a 
simple piece of software that has the lowest priority, but is guaranteed to run. It does nothing 
but reset the watchdog. Failing to do that will result in a system reset. 


We can use a similar technique for Code looping monitoring — We can start a watchdog timer before 
entering a long and complex loop to preclude a run-away situations. This involves more analysis of the 
big picture, 


11. Stack overflow/underflow monitor This routine monitors the contents of memory above and 
below the bounds of the stack for corruption. This condition might mean we have pointer errors, 
or some program is erroneously writing to the wrong memory area. There are also off-the-shelf 
routines for runtime stack monitoring. 


12. Active hardware testing — this can be implemented in the lowest priority module. It doesn't even 
need to be executed every cycle, as long as it gets to run every day, for example. It would be 
good to be able to change the priority of this task. 


One technique is to deliberately generate error conditions such as attempted division by zero, and 
verify that the system has transferred to the correct interrupt handler. 


We should, now and then, test all registers for functionality. A common approach is to shift bits in the 
register, and check against a pre-computed value. The routine should save register contents before, and 
restore them after execution. 


13. Temperature monitoring, using the cpu's internal temperature monitor, to check on out-of spec 
temperatures. This is not directly related to radiation damage, but can also cause sudden system 
failure. Sometimes, exceeding temperature operating specifications can lead to strange timing 
behavior, and failure. 


In an incident I am familiar with involves the computer on NASA's Advanced On-board Processor 
(AOP), which was flown on the International Ultraviolet Explorer (TUE) spacecraft in 1978. During the 
mission, there were many strange computer crashes, leading to loss of attitude control. Nothing like this 
had ever happened in test. 


A memory dump analysis showed corrupted interrupt vectors. This was the first big clue as to what was 
happening. The temperature specification for the onboard electronics was 38 degrees C. The observed 
post-launch temperature was 52 to 55 C. This was suspected to be the problem. 


There was no ground-based facility available that could duplicate the OBC temperature environment 
seen in flight. However, software to provide protection against “hits” was completed 2 months post- 
launch 


About 1.5 years after launch, there was a detailed FMEA of cause of the problem. Analysis revealed a 


fault condition after an ADD of two values, both close to zero, and both negative. An interrupt causes a 
context switch. When the adder circuit was hot, it took longer to settle. Since the adder was also used to 
generate the target address, (there was no separate program counter incrementer), sometimes, with 
previous negative numbers, bit 15 was not cleared. This resulted in a incorrect interrupt jump target 
vector. The root cause, then, was operating outside the specified thermal environment. A diagnostic 
patch to fix the problem was finally uploaded in January 1980. 


An interesting thing to watch out for, is that your optimizing compiler, which does not know the intent 
of your code, might optimize away test routines it deems silly. Watch the optimization carefully, and 
adjust accordingly. 


I'll mention a quick story in debugging and repair from the days when it was a lot harder. In the 1970's 
the OAO spacecraft used the NASA OBP, on board processor, an 18-bit machine with core memory. It 
got the job done. It started having strange behavior some years into the program. And we tried to find 
out what was happening by memory dumps and using our bench model. Really random behavior was 
observed. Some data were corrupt, some arithmetic results were wrong, sometimes the program 
branched off...somewhere. 


To save a chip or two, the hardware designers implemented the address update by passing it through the 
ALU. What could possibility go wrong? Well, bit 7 in the ALU could get stuck. 


End of mission? Well, no. A friend of mine re-wrote all of the flight code (using our standard tools of a 
sharp #2 pencil and lined pad) to NOT use bit 6. Think about that. 


Wrap up 


Since rad hard hardware is not only expensive, but lags in performance compared to commercial grade 
parts, there is a need for a viable work-around. Software, the ideal space component since it doesn't 
weigh anything, is part of the answer. Quite a few things can be tested by the software running on the 
test target. Combined with good engineering analysis, and knowledge of potential failure modes, this 
can go a long way towards increasing the on-orbit reliability of lower cost missions. 


The next step is to create a library of RHS routines.. 
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